In this paper, we propose an analytic model using a semi-markov process for parallel computers which provides hardware support for a cache coherence mechanism. The model proposed here, the Semi-markov Memory and Cache coherence Interference model, can be used for the performance prediction of cache coherence based parallel computers since it can be easily applied to descriptions of the waiting states due to network contention or memory interference of both normal data accesses and cache coherence requests. Conventional analytic models using stochastic processes to describe parallel computers have the problem of numerical explosion in the number of states necessary as the system size increases even for simple parallel computers without cache coherence mechanisms. The number of states required by constructing our proposing analytic model, however, does not depend on the system size but only on the kind of cache coherence protocol. For example, the number of states for the Synapse cache coherence protocol is only 20, as is described in this paper. Using the proposed analytic model, we investigate several comparative experiments with widely known simulation results. We found that there is only a 7.08% difference between the simulation and our analytic model, while our analytic model can predict the performance of a 1,024 processor system in the order of microseconds.
Andrew FLAVELL Yoshizo TAKAHASHI
We propose a new high-performance message router for k-ary n-cube multicomputer systems, called the Tokky router. The router utilizes a small number of queues at the outputs of its communication ports to allow fully adaptive routing, misrouting to prevent deadlocks and randomization to prevent livelock. Uncongeste network performance is improved by the inclusion of the packet expressway. Accurate models are developed to predict the switch and buffer performance of routers for varying radix and dimension and these models can be used in the design of routers for networks other than those investigated here. The simulated performance of the router exceeds that of published results for oblivious routers and is equal to or exceeds those reported for other adaptive routers. These performance predictions are especially encouraging when the simplicity of the control structures required to implement the router are taken into consideration.
We developed a parallel bordered-block-diagonal (BBD) matrix solution for parallel circuit simulation. In parallel circuit sumulation on a MIMD parallel computer, a circuit is partitioned into as many subcircuits as the processors of a parallel computer. Circuit partition produce a BBD matrix. In parallel BBD matrix solution, diagonal blocks are easily solved separately in each processor. It is difficult, however, to solve the interconnection (IC) submatrix of a BBD matrix effectively in parallel. To make matters worse, the more a circuit is partitioned into subcircuits for highly parallel circuit simulation, the larger the size of an IC submatrix becomes. From an examination, we found that an IC submatrix is more dense (about 30% of all entries are non-zeros) than a normal circuit matrix, and the non-zeros per row in an IC submatrix are almost constant with the number of subcircuits. To attain high-speed circuit simulation, we devised a data structure for BBD matrix processing and an approach to parallel BBD matrix solution. Our approach solves the IC submatrix in a BBD matrix as well as the diagonal blocks in parallel using all processors. In this approach, we allocate an IC submatrix in block-wise order rather than in dot-wise order onto all processors. Thus, we balance the processor perfomance with the communication capacity of a parallel computer system. When we changed the block size of IC submatrix allocation from dot-wise order to 88 block-wise order, the 88 block-wise order allocation almost halved the matrix solution time. The parallel simulation of a sample circuit with 3277 transistors was 16.6 times faster than a single processor when we used 49 processors.
Takanobu BABA Akehito GUNJI Yoshifumi IWAMOTO
A network-topology-independent static task allocation strategy has been designed and implemented for massively parallel computers. For mapping a task graph to a processor graph, this strategy evaluates several functions that represent some intuitively feasible properties or the graphs. They include the connectivity with the allocated nodes, distance from the median of a graph, connectivity with candidate nodes, and the number of candidate nodes within a distance. Several greedy strategies are defined to guide the mapping process, utilizing the indicated function values. An allocation system has been designed and implemented based on the allocation strategy. In experiments we have defined about 1000 nodes in task graphs with regular and irregular topologies, and the same order of processors with mesh, tree, and hypercube topologies. The results are summarized as follows. 1) The system can yield 4.0 times better total communication costs than an arbitrary allocation. 2) It is difficult to select a single strategy capable of providing the best solutions for a wide range of task-processor combinations. 3) Comparison with hypercube-topology-dependent research indicates that our topology-independent allocator produces better results than the dependent ones. 4) The order of computaion time of the allocator is experimentally proved to be O (n2) where n represents the number of tasks.
Three dimensional (3-D) optics offers potential advantages to the massively-parallel systems over electronics from the view point of information transfer. The purpose of this paper is to survey some aspects of the 3-D optical interconnection technology for the future massively-parallel computing systems. At first, the state-of-art of the current optoelectronic array devices to build the interconnection networks are described, with emphasis on those based on the semiconductor technology. Next, the principles, basic architectures, several examples of the 3-D optical interconnection systems in neural networks and multiprocessor systems are described. Finally, the issues that are needed to be solved for putting such technology into practical use are summarized.
Hideo KIKUCHI Takashi YUKAWA Kazumitsu MATSUZAWA Tsutomu ISHIKAWA
This paper discusses the design, implementation, and performance of a bus-connected multiprocessor, called Presto, for a Rete-based production system. To perform a match, which is a major phase of a production system, a Presto match scheme exploits the subnetworks that are separated by the top two-input nodes and the token flow control at these nodes. Since parallelism of a production system can only increase speed 10-fold, the aim is to do so efficiently on a low-cost, compact bus-connected multi-processor system without shared memory or cache memory. The Presto hardware consists of up to 10 processisng elements (PEs), each comprising a commercial microprocessor, 4 Mbytes of local memory, and two kinds of newly developed ASIC chips for memory control and bus control. Hierarchical system software is provided for developing interpreter programs. Measurement with 10 PEs shows that sample programs run 5-7 times faster.